Introduction to Open Data Science - Course Project

Chapter1:Introduction

I am a PhD student in Economic history and want to learn R to be able to use it in my research. The course seems some what challenging for me, but I am quite ecxited to learn these new skills!

My Github repository is https://github.com/hshirvon/IODS-project

My course diary is at: https://hshirvon.github.io/IODS-project/

date()
## [1] "Sun Dec  5 14:23:55 2021"

Chapter2: Regression and model validation

This week I have been learning simple linear regression.

date()
## [1] "Sun Dec  5 14:23:55 2021"

1. First, I will read the students2014 data into R from my local folder and explore the structure and the dimensions of the data + also trying out a few other functions.

heididata <- read.table("data/heidindata.csv", sep=",", header=TRUE)
#Explore the structure and the dimensions of the data 
str(heididata)
## 'data.frame':    166 obs. of  8 variables:
##  $ X             : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sukupuoli     : chr  "F" "M" "F" "M" ...
##  $ lrn14.Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ lrn14.Attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ lrn14.deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ lrn14.stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ lrn14.surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ lrn14.Points  : int  25 12 24 10 22 21 21 31 24 26 ...
dim(heididata)
## [1] 166   8
head(heididata)
##   X sukupuoli lrn14.Age lrn14.Attitude lrn14.deep lrn14.stra lrn14.surf
## 1 1         F        53            3.7   3.583333      3.375   2.583333
## 2 2         M        55            3.1   2.916667      2.750   3.166667
## 3 3         F        49            2.5   3.500000      3.625   2.250000
## 4 4         M        53            3.5   3.500000      3.125   2.250000
## 5 5         M        49            3.7   3.666667      3.625   2.833333
## 6 6         F        38            3.8   4.750000      3.625   2.416667
##   lrn14.Points
## 1           25
## 2           12
## 3           24
## 4           10
## 5           22
## 6           21

The dataset “heididata” includes 8 variables and 166 observations. It is a subdata of an international survey of Approaches to Learning, which has been collected 3.12.2014 - 10.1.2015. The variables included in the “heididata” are gender, age, attitude, deep learning, strategic learning, surface learning, points and a running variable X.

2. Next, I will show a graphical overview of the data and show summaries of the variables in the data. I will also describe and interpret the outputs.

plot(heididata)

summary(heididata)
##        X           sukupuoli           lrn14.Age     lrn14.Attitude 
##  Min.   :  1.00   Length:166         Min.   :17.00   Min.   :1.400  
##  1st Qu.: 44.25   Class :character   1st Qu.:21.00   1st Qu.:2.600  
##  Median : 87.50   Mode  :character   Median :22.00   Median :3.200  
##  Mean   : 90.13                      Mean   :25.51   Mean   :3.143  
##  3rd Qu.:136.75                      3rd Qu.:27.00   3rd Qu.:3.700  
##  Max.   :183.00                      Max.   :55.00   Max.   :5.000  
##    lrn14.deep      lrn14.stra      lrn14.surf     lrn14.Points  
##  Min.   :1.583   Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:3.333   1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.667   Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.680   Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:4.083   3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.917   Max.   :5.000   Max.   :4.333   Max.   :33.00

The age of the respondents is between 17 and 55; attitude approximately 1,4 - 5,0; deep learning approximately 1.6 - 4,9; strategic learning approximately 1,3 - 5,0; surface learning approximately 1,6 - 4,3 and points 7 - 33.

3. My regression model explains the target (dependent) variable, exam points, with gender, age and attitude.

my_regression <- lm(lrn14.Points ~ sukupuoli + lrn14.Age + lrn14.Attitude , data = heididata)

#Print out a summary of my regression model
summary(my_regression)
## 
## Call:
## lm(formula = lrn14.Points ~ sukupuoli + lrn14.Age + lrn14.Attitude, 
##     data = heididata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.4590  -3.3221   0.2186   4.0247  10.4632 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    13.42910    2.29043   5.863 2.48e-08 ***
## sukupuoliM     -0.33054    0.91934  -0.360    0.720    
## lrn14.Age      -0.07586    0.05367  -1.414    0.159    
## lrn14.Attitude  3.60657    0.59322   6.080 8.34e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.315 on 162 degrees of freedom
## Multiple R-squared:  0.2018, Adjusted R-squared:  0.187 
## F-statistic: 13.65 on 3 and 162 DF,  p-value: 5.536e-08

The regression model explains the target (dependent) variable, exam points, with gender, strategic learning and attitude. Attitude has a statistically significant relationship with Points, but gender and age do not.

Next, I will run the regression again with just attitude as the explanatory variable and print out a summary of the regression model:

my_regression2 <- lm(lrn14.Points ~ lrn14.Attitude , data = heididata)
summary(my_regression2)
## 
## Call:
## lm(formula = lrn14.Points ~ lrn14.Attitude, data = heididata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     11.6372     1.8303   6.358 1.95e-09 ***
## lrn14.Attitude   3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

4. Since gender and age did not have a statistically significant effect, I will analyze the summary of the my_regression2 with attitude as the explanatory variable.

The coefficient is 3.5255, meaning that when the explanatory variable (attitude) moves one unit, the target variable moves 3.5255 units. One unit better attitude, should result in 3.5255 more points!

The multiple R squared of the model is 0.1906. This means that approximately 19.1% the variation in points is explained by the variation in attitude.

5

In linear regression model we need to assume linearity as well as that the errors are normally distributed, the errors are not correlated, the errors have constant variance, the size of a given error does not depend on the explanatory variables. A boxplot or probability plot of the residuals can be useful in checking for symmetry and specifically the normality of the error terms in the regression model.

I am using the linear model object “my_regression” with several explanatory variables that was created earlier, I will produce the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.

#Let's place the following 4 graphics to the same plot:
par(mfrow = c(2,2))

#Calling the plot function with 1 (Residuals vs Fitted values), 2 (Normal QQ-plot) & 5 (Residuals vs Leverage). 
plot(my_regression,which=c(1,2,5)) 

First, let’s look at the residuals against the fitted values of the response variable. If the variability of the residuals appears to increase with the size of the fitted values, a transformation of the response variable prior to fitting is indicated. In my case, the residuals vs. fitted values looks pretty much what is expected to confirm that the fitted model is appropriate. Seems that assumption of constant variance is justified.

the QQ-plot of the residuals provides a method to explore the assumption that the errors are normally distributed. In the there is a small dip in the beginning and in the end which might cause a small worry for the the assumption that the errors are normally distributed. Still, the plot shows a reasonable enough fit of errors with the straight line.

Residuals vs. leverage plot can help identify which observations have an unusually high impact. Residuals vs. leverage does not look very balanced. Some observations have an unusually large impact.


Chapter3:Logistic Regression

#Heidi Hirvonen 18.11.2020

1. First, I created the Rmd file, Chapter3 and added it as a child file in the index.Rmd.

2. Next, I will read the joined student alcohol consumption data into R from https://github.com/rsund/IODS-project/raw/master/data/alc.csv

joined_data <- read.csv("https://github.com/rsund/IODS-project/raw/master/data/alc.csv", sep=",", header=TRUE)

colnames(joined_data)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "n"          "id.p"       "id.m"      
## [31] "failures"   "paid"       "absences"   "G1"         "G2"        
## [36] "G3"         "failures.p" "paid.p"     "absences.p" "G1.p"      
## [41] "G2.p"       "G3.p"       "failures.m" "paid.m"     "absences.m"
## [46] "G1.m"       "G2.m"       "G3.m"       "alc_use"    "high_use"  
## [51] "cid"

3. Next I will analyze the relationships between high/low alcohol consumption and some of the other variables in the data.

4. First, I will access the tidyverse libraries tidyr, dplyr and ggplot2. Then I will numerically and graphically explore the distributions of my chosen variables and alcohol use (crosstabulations, barplots & boxplots)

library(tidyr)

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

glimpse(joined_data)
## Rows: 370
## Columns: 51
## $ school     <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",…
## $ sex        <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F",…
## $ age        <int> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15,…
## $ address    <chr> "R", "R", "R", "R", "R", "R", "R", "R", "R", "U", "U", "U",…
## $ famsize    <chr> "GT3", "GT3", "GT3", "GT3", "GT3", "GT3", "GT3", "LE3", "LE…
## $ Pstatus    <chr> "T", "T", "T", "T", "T", "T", "T", "T", "T", "A", "A", "T",…
## $ Medu       <int> 1, 1, 2, 2, 3, 3, 3, 2, 3, 3, 4, 1, 1, 1, 1, 1, 2, 2, 2, 3,…
## $ Fedu       <int> 1, 1, 2, 4, 3, 4, 4, 2, 1, 3, 3, 1, 1, 1, 2, 2, 1, 2, 3, 2,…
## $ Mjob       <chr> "at_home", "other", "at_home", "services", "services", "ser…
## $ Fjob       <chr> "other", "other", "other", "health", "services", "health", …
## $ reason     <chr> "home", "reputation", "reputation", "course", "reputation",…
## $ guardian   <chr> "mother", "mother", "mother", "mother", "other", "mother", …
## $ traveltime <int> 2, 1, 1, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 1, 1, 1, 3, 1, 2, 1,…
## $ studytime  <int> 4, 2, 1, 3, 3, 3, 3, 2, 4, 4, 2, 1, 2, 2, 2, 2, 3, 4, 1, 2,…
## $ schoolsup  <chr> "yes", "yes", "yes", "yes", "no", "yes", "no", "yes", "no",…
## $ famsup     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ activities <chr> "yes", "no", "yes", "yes", "yes", "yes", "no", "no", "no", …
## $ nursery    <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "no"…
## $ higher     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ internet   <chr> "yes", "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes…
## $ romantic   <chr> "no", "yes", "no", "no", "yes", "no", "yes", "no", "no", "n…
## $ famrel     <int> 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 5, 5, 3, 3,…
## $ freetime   <int> 1, 3, 3, 3, 2, 3, 2, 1, 4, 3, 3, 3, 3, 4, 3, 2, 2, 1, 5, 3,…
## $ goout      <int> 2, 4, 1, 2, 1, 2, 2, 3, 2, 3, 2, 3, 2, 2, 2, 3, 2, 2, 1, 2,…
## $ Dalc       <int> 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1,…
## $ Walc       <int> 1, 4, 1, 1, 3, 1, 2, 3, 3, 1, 1, 2, 3, 2, 1, 2, 1, 1, 1, 1,…
## $ health     <int> 1, 5, 2, 5, 3, 5, 5, 4, 3, 4, 1, 4, 4, 5, 5, 1, 4, 3, 5, 3,…
## $ n          <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ id.p       <int> 1096, 1073, 1040, 1025, 1166, 1039, 1131, 1069, 1070, 1106,…
## $ id.m       <int> 2096, 2073, 2040, 2025, 2153, 2039, 2131, 2069, 2070, 2106,…
## $ failures   <int> 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2,…
## $ paid       <chr> "yes", "no", "no", "no", "yes", "no", "no", "no", "no", "no…
## $ absences   <int> 3, 2, 8, 2, 5, 2, 0, 1, 9, 10, 0, 3, 2, 0, 4, 1, 2, 6, 2, 1…
## $ G1         <int> 10, 10, 14, 10, 12, 12, 11, 10, 16, 10, 14, 10, 11, 10, 12,…
## $ G2         <int> 12, 8, 13, 10, 12, 12, 6, 10, 16, 10, 14, 6, 11, 12, 12, 14…
## $ G3         <int> 12, 8, 12, 9, 12, 12, 6, 10, 16, 10, 15, 6, 11, 12, 12, 14,…
## $ failures.p <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ paid.p     <chr> "yes", "no", "no", "no", "yes", "no", "no", "no", "no", "no…
## $ absences.p <int> 4, 2, 8, 2, 2, 2, 0, 0, 6, 10, 0, 6, 2, 0, 6, 0, 0, 4, 4, 2…
## $ G1.p       <int> 13, 13, 14, 10, 13, 11, 10, 11, 15, 10, 15, 11, 13, 12, 13,…
## $ G2.p       <int> 13, 11, 13, 11, 13, 12, 11, 10, 15, 10, 14, 12, 12, 12, 12,…
## $ G3.p       <int> 13, 11, 12, 10, 13, 12, 12, 11, 15, 10, 15, 13, 12, 12, 13,…
## $ failures.m <int> 1, 2, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,…
## $ paid.m     <chr> "yes", "no", "yes", "yes", "yes", "yes", "no", "yes", "no",…
## $ absences.m <int> 2, 2, 8, 2, 8, 2, 0, 2, 12, 10, 0, 0, 2, 0, 2, 2, 4, 8, 0, …
## $ G1.m       <int> 7, 8, 14, 10, 10, 12, 12, 8, 16, 10, 14, 8, 9, 8, 10, 16, 1…
## $ G2.m       <int> 10, 6, 13, 9, 10, 12, 0, 9, 16, 11, 15, 0, 10, 11, 11, 15, …
## $ G3.m       <int> 10, 5, 13, 8, 10, 11, 0, 8, 16, 11, 15, 0, 10, 11, 11, 15, …
## $ alc_use    <dbl> 1.0, 3.0, 1.0, 1.0, 2.5, 1.0, 2.0, 2.0, 2.5, 1.0, 1.0, 1.5,…
## $ high_use   <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE,…
## $ cid        <int> 3001, 3002, 3003, 3004, 3005, 3006, 3007, 3008, 3009, 3010,…
gather(joined_data) %>% glimpse
## Rows: 18,870
## Columns: 2
## $ key   <chr> "school", "school", "school", "school", "school", "school", "sch…
## $ value <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP"…
#Produce summary statistics of high_use and health
joined_data %>% group_by(high_use) %>% summarise(count = n(),mean(health))
## # A tibble: 2 x 3
##   high_use count `mean(health)`
##   <lgl>    <int>          <dbl>
## 1 FALSE      259           3.49
## 2 TRUE       111           3.73
# initialize a plot of high_use and health
a1 <- ggplot(joined_data, aes(x = high_use, y = health))

# define the plot as a boxplot and draw it
a1 + geom_boxplot()+ xlab("high_use") + ylab("health") 

Second, let’s take a look at high alcohol use and absences. - The connection between absences and high alcohol use are as I expected. Mean absences for the students with high alcohol use is 6,38 days and ones with less alcohol use is 3,71 days.

#Produce summary statistics of high_use and absences
joined_data %>% group_by(high_use) %>% summarise(count = n(),mean(absences))
## # A tibble: 2 x 3
##   high_use count `mean(absences)`
##   <lgl>    <int>            <dbl>
## 1 FALSE      259             3.71
## 2 TRUE       111             6.38
# initialize a plot of high_use and absences
a2 <- ggplot(joined_data, aes(x = high_use, y = absences))

# define the plot as a boxplot and draw it
a1 + geom_boxplot() +  xlab("high_use") + ylab("absences")

#Produce summary statistics of family support and high alcohol use
joined_data %>% group_by(famsup) %>% summarise(count = n(),mean(high_use))
## # A tibble: 2 x 3
##   famsup count `mean(high_use)`
##   <chr>  <int>            <dbl>
## 1 no       139            0.324
## 2 yes      231            0.286
# initialize a plot of high_use and health
a3 <- ggplot(joined_data, aes(x = famsup, y = high_use))

# define the plot as a boxplot and draw it
a3 + geom_boxplot() +  xlab("family support") + ylab("high_use")

#Produce summary statistics of extra curricular activities and high alcohol use
joined_data %>% group_by(activities) %>% summarise(count = n(),mean(high_use))
## # A tibble: 2 x 3
##   activities count `mean(high_use)`
##   <chr>      <int>            <dbl>
## 1 no           179            0.330
## 2 yes          191            0.272
# initialize a plot of high_use and absences
a4 <- ggplot(joined_data, aes(x = activities, y = high_use))

# define the plot as a boxplot and draw it
a4 + geom_boxplot() + xlab("activities")+ ylab("high_use")

5. Next, I am using logistic regression to statistically explore the relationship between my chosen variables and the binary high/low alcohol consumption variable as the target variable. I will present and interpret the coefficients of the models as odds ratios and provide confidence intervalls for them

#regression
m <- glm(high_use ~ health + absences + famsup + activities, data = joined_data , family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ health + absences + famsup + activities, 
##     family = "binomial", data = joined_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5664  -0.8251  -0.7171   1.2578   1.9729  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -1.48808    0.39040  -3.812 0.000138 ***
## health         0.14099    0.08613   1.637 0.101637    
## absences       0.08975    0.02315   3.877 0.000106 ***
## famsupyes     -0.23892    0.24092  -0.992 0.321337    
## activitiesyes -0.29591    0.23480  -1.260 0.207580    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 429.59  on 365  degrees of freedom
## AIC: 439.59
## 
## Number of Fisher Scoring iterations: 4
coef(m)
##   (Intercept)        health      absences     famsupyes activitiesyes 
##   -1.48807647    0.14098501    0.08974717   -0.23892472   -0.29590959
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
##                      OR     2.5 %    97.5 %
## (Intercept)   0.2258066 0.1027826 0.4768908
## health        1.1514074 0.9747528 1.3673371
## absences      1.0938977 1.0475837 1.1472586
## famsupyes     0.7874742 0.4913683 1.2657321
## activitiesyes 0.7438547 0.4683710 1.1776531

As we can see from the regression coefficients, family support and extra curricular activities are negatively correlated with high alcohol use. On the other hand, good current health status and absences are positively correlated with high alcohol use. The only statistically signifigant result is the connection between high alcohol use and absences.

The knowledge I have on the data is not enough to make clear causal interpretations, although it might be tempting to claim that family support and extra curricular activities might lead to less alcohol consumption.

It is also not very surprising that absences and high alcohol consumption are posilively correlated, but it is more difficult to say which one is the cause and which one is the effect (or if there is some thirs variable causing both of these)

To me, the most surprising result is the fact that high alcohol consumption and good current health status are positively correlated. It might be that the possible health hazards of high alcohol use do not show, because the respondents are so young. I would still have thought that high alcohol use would have been connected with worse mental health also on young people. So I guess it all comes down to how the health status is measures. It might be that the youngsters with bad health do not want to risk it more by using alcohol.

6. Using the variables which according to my regression model had a statistical relationship with high/low alcohol consumption, I will explore the predictive power of my model. I will provide a 2X2 crosstabulation of predictions versus the actual values and optionally display a graphic visualizing both the actual values and the predictions.

# fit the model
m <- glm(high_use ~ health + absences + famsup + activities , data = joined_data, family = "binomial")
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'joined_data'
joined_data <- mutate(joined_data, probability = probabilities)
# use the probabilities to make a prediction of high_use
joined_data <- mutate(joined_data, prediction = probability > 0.5)
# see the last ten original classes, predicted probabilities, and class predictions
select(joined_data, sex, age, goout, address, high_use, probability, prediction) %>% tail(10)
##     sex age goout address high_use probability prediction
## 361   M  18     3       R     TRUE   0.3924820      FALSE
## 362   M  18     3       R     TRUE   0.2090412      FALSE
## 363   M  18     1       R     TRUE   0.3535075      FALSE
## 364   M  18     4       U     TRUE   0.2417650      FALSE
## 365   M  18     2       U    FALSE   0.3109075      FALSE
## 366   M  18     3       U     TRUE   0.3304539      FALSE
## 367   M  18     2       U    FALSE   0.3136411      FALSE
## 368   M  19     4       R     TRUE   0.3400464      FALSE
## 369   M  19     4       R     TRUE   0.3803344      FALSE
## 370   M  19     2       R    FALSE   0.3136411      FALSE
# tabulate the target variable versus the predictions
table(high_use = joined_data$high_use, prediction = joined_data$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   251    8
##    TRUE     98   13

Chapter4:Clustering and classification

1. First, I created the Rmd file, Chapter4 and added it as a child file in the index.Rmd.

2. I will load the Boston dataset from the MASS package and descirbe it briefly.

# access the MASS package
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
# load the data
data("Boston")
# explore the dataset
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

The Boston dataset is a dataset of Housing Values in Suburbs of Boston. It includes 14 variables and 506 observations The 14 variables in the dataset are:

3.I will show a graphical overview of the data and show summaries of the variables in the data.

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
#summaries are interpreted along with appropriate graphics of all the continuous variables
boxplot(Boston$crim)

boxplot(Boston$zn)

boxplot(Boston$indus)

boxplot(Boston$nox)

boxplot(Boston$rm)

boxplot(Boston$age)

boxplot(Boston$dis)

boxplot(Boston$tax)

boxplot(Boston$ptratio)

boxplot(Boston$black)

boxplot(Boston$lstat)

boxplot(Boston$medv)

From the summaries we get an overview of the variables, here are some interesting observations:

library(corrplot)
## corrplot 0.92 loaded
library(tidyr)

# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits = 2)

# print the correlation matrix
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)

I am interested in education so I want to take a closer look at how the - pupil-teacher ratio by town is correlated with the other variables. It seems that there is a somewhat strong positive correlation with:

and a negative correlation with:

An interesting observation is that a high pupil teacher ratio is connected to a higher proportion of lower status of the population and a lower median value of owner-occupied homes.

4. Next, I will standardize the dataset and print out summaries of the scaled data. As we can see, standardization will set the mean of all the variables to 0.

# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

Next, let’s create a categorical variable of the crime rate in the Boston dataset using the scaled crime rate and quantiles as the break points. Let’s also drop the old crime rate variable from the dataset and divide the dataset to train and test sets, so that 80% of the data belongs to the train set.

# summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

#Dividing the dataset to train and test sets, so that 80% of the data belongs to the train set

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

5. Fitting the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. I will draw the LDA (bi)plot.

# MASS and train are available
library(MASS)

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2450495 0.2599010 0.2599010 0.2351485 
## 
## Group means:
##                  zn      indus         chas        nox         rm        age
## low       0.9796776 -0.8625334 -0.113254311 -0.8634158  0.4863375 -0.8638831
## med_low  -0.1054295 -0.2827728 -0.009855719 -0.5742493 -0.1528789 -0.2936310
## med_high -0.3712677  0.1454989  0.177625245  0.4423399  0.0875555  0.4227982
## high     -0.4872402  1.0172655 -0.023670105  1.0738156 -0.4453914  0.8396654
##                 dis        rad        tax     ptratio       black       lstat
## low       0.8081900 -0.6756133 -0.7315164 -0.44836468  0.37104224 -0.77309795
## med_low   0.3932677 -0.5443600 -0.4774927 -0.07272163  0.31182651 -0.09929382
## med_high -0.3683448 -0.4306070 -0.3262755 -0.35866306  0.04736163  0.04336872
## high     -0.8834537  1.6366336  1.5129868  0.77903654 -0.79370730  0.88907465
##                 medv
## low       0.55238311
## med_low  -0.03069772
## med_high  0.15145071
## high     -0.65846425
## 
## Coefficients of linear discriminants:
##                   LD1         LD2         LD3
## zn       0.0665514002  0.71216047 -0.96893776
## indus    0.0155367525 -0.08537207  0.33217548
## chas    -0.0818298307 -0.03066944  0.10485043
## nox      0.3125394529 -0.93009260 -1.22188774
## rm      -0.1265743716 -0.08446652 -0.20248068
## age      0.2345835463 -0.32422584  0.03955617
## dis     -0.0731981597 -0.36170603  0.46178413
## rad      3.4138108959  0.89104221 -0.01321704
## tax     -0.0007020999  0.07072678  0.45985265
## ptratio  0.1285410336 -0.06966325 -0.24972945
## black   -0.1309585474  0.05175419  0.09136577
## lstat    0.2327346341 -0.24564549  0.34172422
## medv     0.2254075463 -0.41219481 -0.09069186
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9480 0.0375 0.0145
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

6. Next, I will save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set.

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       17      10        1    0
##   med_low    5      12        4    0
##   med_high   0       6       13    2
##   high       0       0        0   32

The LDA model fitted using 80% of the original data set predicts the categories of the crime variable quite well. In the case of high the prediction is best.

7. Next I will reload the Boston dataset and standardize it. Then I will calculate the distances between the observations and run k-means algorithm on the dataset.

I will also investigate what is the optimal number of clusters and run the algorithm again. I will isualize the clusters with the pairs() & ggpairs() functions, where the clusters are separated with colors) and interpret the results.

# load the data again
data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
dim(Boston)
## [1] 506  14
#standardizing the data
boston_scaled2 <- scale(Boston)
boston_scaled2 <- as.data.frame(boston_scaled2)
summary(boston_scaled2)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# euclidean distance matrix
dist_eu <- dist(boston_scaled2)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled2, method = 'manhattan')

# look at the summary of the distances
summary(dist_man) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618
# k-means clustering
km <- kmeans(boston_scaled2, centers = 3) 

# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

# Boston dataset is available
set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})


# visualize the results

library (ggplot2)
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km <-kmeans(Boston, centers = 2)

# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)


Chapter5:Dimensionality reduction techniques

1. Showing a graphical overview of the data and show summaries of the variables in the data.

#human <- read.csv("data/human.csv", header=TRUE)
#summary(human)

human_ <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", sep=",", header=TRUE)
summary(human_)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(dplyr)
library (corrplot)
# visualize the 'human_' variables
ggpairs(human_)

# compute the correlation matrix and visualize it with corrplot
cor(human_) %>% corrplot

There are large differences in the variables between different countries. Here are some observations:

There is a positive correlation between expected education and life expectancy also a positive correlation between adolescent birthrate and maternal mortality

There is a negative correlation between maternal mortality and life expectancy There is a negative correlation between maternal mortality and expected education There is also a negative correlation between adolescent birthrate and life expectancy There is also a negative correlation between adolescent birthrate and expected education

We can note that higher female secondary education is positively correlatad with things like educational expectation, life expectacy and GNI and negatively correlated with adolescent birthrate and maternal mortality.

2. Next, I will perform principal component analysis (PCA) on the not standardized human data. I will show the variability captured by the principal components and draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables.

# print out summaries of the human_ variables
summary(human_)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_)

# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

3. Next, I will standardize the variables in the human_ data and repeat the above analysis.

The standardization will set the mean of the variables to zero thus creating a different biplot.

# standardize the variables
human_std <- scale(human_)

# print out summaries of the standardized variables
summary(human_std)
##     Edu2.FM           Labo.FM           Edu.Exp           Life.Exp      
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_std)

# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2)

4. Based on the biplot drawn after PCA on the standardized human data we can note the following:

5. I will load the tea dataset from the package FactoMineR and explore the data briefly.

# access the Factominer, ggplot2, dplyr and tidyr packages
library(FactoMineR)
library(ggplot2)
library(dplyr)
library(tidyr)

# load the tea data and exploring the data
data("tea")
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   15-24:92   1/day      : 95  
##  middle      :40   sportsman    :179   25-34:69   1 to 2/week: 44  
##  non-worker  :64                       35-44:40   +2/day     :127  
##  other worker:20                       45-59:61   3 to 6/week: 34  
##  senior      :35                       +60  :38                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming          exciting  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255   exciting   :116  
##  Not.feminine:171   sophisticated    :215   slimming   : 45   No.exciting:184  
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##         relaxing              effect.on.health
##  No.relaxing:113   effect on health   : 66    
##  relaxing   :187   No.effect on health:234    
##                                               
##                                               
##                                               
##                                               
## 
#The tea dataset contains 36 variables and 300 observations

# visualize the dataset
boxplot(tea[,0:18], main='Multiple Box plots')

boxplot(tea$age)

boxplot (tea[,20:36], main='Multiple Box plots')

Next, I will create a new data set “tea_time” and do Multiple Correspondence Analysis (MCA) on it and comment on the outputs.

# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")

# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, one_of(keep_columns))

# look at the summaries and structure of the data
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
# visualize MCA
plot(mca, invisible=c("ind"))

plot(mca, invisible=c("var"))

#adjusting the code by adding argument habillage = "quali". This will differentiate the different variables with different colors, making the visualization easier to interpret.
plot(mca, invisible=c("ind"), habillage = "quali")

# multiple correspondence analysis also visualizes the analysis by default
mca <- MCA(tea_time)

(more chapters to be added similarly as we proceed with the course!)